Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
نویسندگان
چکیده
We address the creation of cross-lingual textual entailment corpora by means of crowdsourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.
منابع مشابه
Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush
This paper reports on experiments in the creation of a bi-lingual Textual Entailment corpus, using non-experts’ workforce under strict cost and time limitations ($100, 10 days). To this aim workers have been hired for translation and validation tasks, through the CrowdFlower channel to Amazon Mechanical Turk. As a result, an accurate and reliable corpus of 426 English/Spanish entailment pairs h...
متن کاملDetecting Cross-Lingual Semantic Divergence for Neural Machine Translation
Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves transl...
متن کاملSemeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization
This paper presents the second round of the task on Cross-lingual Textual Entailment for Content Synchronization, organized within SemEval-2013. The task was designed to promote research on semantic inference over texts written in different languages, targeting at the same time a real application scenario. Participants were presented with datasets for different language pairs, where multi-direc...
متن کاملSemeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization
This paper presents the first round of the task on Cross-lingual Textual Entailment for Content Synchronization, organized within SemEval-2012. The task was designed to promote research on semantic inference over texts written in different languages, targeting at the same time a real application scenario. Participants were presented with datasets for different language pairs, where multi-direct...
متن کاملUsing Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in ...
متن کامل